Neural architecture search using a performance prediction neural network

ABSTRACT

A method for determining an architecture for a task neural network configured to perform a particular machine learning task is described. The method includes obtaining data specifying a current set of candidate architectures for the task neural network; for each candidate architecture in the current set: processing the data specifying the candidate architecture using a performance prediction neural network having multiple performance prediction parameters, the performance prediction neural network being configured to process the data specifying the candidate architecture in accordance with current values of the performance prediction parameters to generate a performance prediction that characterizes how well a neural network having the candidate architecture would perform after being trained on the particular machine learning task; and generating an updated set of candidate architectures by selecting one or more of the candidate architectures in the current set based on the performance predictions for the candidate architectures in the current set.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/861,491 filed Apr. 29, 2020, which is a continuation of InternationalApplication No. PCT/US2018/063293, filed Nov. 30, 2018, which is anon-provisional of and claims priority to U.S. Provisional PatentApplication No. 62/593,213, filed on Nov. 30, 2017, the entire contentsof which are hereby incorporated by reference.

BACKGROUND

This specification relates to determining architectures for neuralnetworks.

Neural networks are machine learning models that employ one or morelayers of nonlinear units to predict an output for a received input.Some neural networks include one or more hidden layers in addition to anoutput layer. The output of each hidden layer is used as input to thenext layer in the network, i.e., the next hidden layer or the outputlayer. Each layer of the network generates an output from a receivedinput in accordance with current values of a respective set ofparameters.

Some neural networks are recurrent neural networks. A recurrent neuralnetwork is a neural network that receives an input sequence andgenerates an output sequence from the input sequence. In particular, arecurrent neural network can use some or all of the internal state ofthe network from a previous time step in computing an output at acurrent time step. An example of a recurrent neural network is a longshort-term memory (LSTM) neural network that includes one or more LSTMmemory blocks. Each LSTM memory block can include one or more cells thateach include an input gate, a forget gate, and an output gate that allowthe cell to store previous states for the cell, e.g., for use ingenerating a current activation or to be provided to other components ofthe LSTM neural network.

SUMMARY

This specification describes a system implemented as computer programson one or more computers in one or more locations that determines anetwork architecture for a task neural network that is configured toperform a particular machine learning task.

The subject matter described in this specification can be implemented inparticular embodiments so as to realize one or more of the followingadvantages. By determining the architecture of a task neural networkusing the techniques described in this specification, the system candetermine a network architecture that achieves or even exceeds state ofthe art performance on any of a variety of machine learning tasks, e.g.,image classification or another image processing task. Additionally, thesystem can determine the architecture of the task neural network (forexample, determining an output cell that is repeated throughout thearchitecture of the task neural network) in a specific manner that ismuch more computationally efficient than existing techniques, i.e., thatconsumes many fewer computational resources than existing techniques. Inparticular, many existing techniques rely on evaluating the performanceof a large number of candidate architectures by training a networkhaving the candidate architecture. This training is both time consumingand computationally intensive. The described techniques greatly reducethe amount of instances of the task neural network that need to betrained by instead employing a performance prediction neural networkthat effectively predicts the performance of a trained network having acandidate architecture, i.e., without needing to actually train anetwork having the candidate architecture. In some describedimplementations, this approach is combined with otherresource-conserving approaches, i.e., techniques that effectively limitthe search space of possible architectures of the final outputarchitecture without adversely affecting and, in some cases, evenimproving the performance of the resulting task neural network thatincludes multiple instances of the output architecture, to achieve evengreater computational efficiency. For example, other resource-conservingapproaches may include learning the architecture of a convolutional cellor other type of cell that includes multiple blocks of operations, andthen repeating the learned cell according to a pre-determined templateto generate the architecture of the task neural network.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an architecture of an example neural architecture search(NAS) system.

FIG. 2 illustrates an architecture of an example cell of a task neuralnetwork.

FIG. 3 shows an architecture of an example task neural network.

FIG. 4 is a flow diagram of an example process for determining thearchitecture of an output cell.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification describes a neural architecture search systemimplemented as computer programs on one or more computers in one or morelocations that determines a network architecture for a task neuralnetwork. The task neural network is configured to perform a particularmachine learning task.

In general, the task neural network is configured to receive a networkinput and to process the network input to generate a network output forthe input.

In some cases, the task neural network is a convolutional neural networkthat is configured to receive an input image and to process the inputimage to generate a network output for the input image, i.e., to performsome kind of image processing task.

For example, the task may be image classification and the outputgenerated by the neural network for a given image may be scores for eachof a set of object categories, with each score representing an estimatedlikelihood that the image contains an image of an object belonging tothe category.

As another example, the task can be image embedding generation and theoutput generated by the neural network can be a numeric embedding of theinput image.

As yet another example, the task can be object detection and the outputgenerated by the neural network can identify locations in the inputimage at which particular types of objects are depicted.

In some other cases, the task can be video classification and the taskneural network is configured to receive as input a video or a portion ofa video and to generate an output that determines what topic or topicsthat the input video or video portion relates to.

In some other cases, the task can be speech recognition and the taskneural network is configured to receive as input audio data and togenerate an output that determines, for a given spoken utterance, theterm or terms that the utterance represents.

In some other cases, the task can be text classification and the taskneural network is configured to receive an input text segment and togenerate an output that determines what topic or topics an input textsegment relates to.

FIG. 1 shows an example neural architecture search (NAS) system 100. Theneural architecture search system 100 is an example of a systemimplemented as computer programs on one or more computers in one or morelocations, in which the systems, components, and techniques describedbelow can be implemented.

In some implementations, the NAS system 100 is configured to determine anetwork architecture for a task neural network by determining anarchitecture for an output cell 150 that is repeated throughout thenetwork architecture. That is, the task neural network includes multipleinstances of the output cell 150. The number of filters of convolutionaloperations within the instances of the output cell 150 may differ basedon a position of the instances within the task neural network. In somecases, the task neural network includes a stack of multiple instances ofthe output cells 150. In some cases, in addition to the stacked ofoutput cells, the task neural network includes one or more other neuralnetwork layers, e.g., an output layer and/or one or more other types oflayers. For example, the task neural network may include a convolutionalneural network layer followed by a stack of multiple instances of theoutput cell followed by a global pooling neural network layer followedby a softmax classification neural network layer. An examplearchitecture of a task neural network is described in more detail belowwith reference to FIG. 3.

Generally, a cell is a fully convolutional neural network that isconfigured to receive a cell input and to generate a cell output. Insome implementations, the cell output may have a same dimension as thecell input, e.g., the same height (H), width (W), and depth (F). Forexample, a cell may receive a feature map as input and generate anoutput feature map having the same dimension as the input feature map.In some other implementations, the cell output may have a dimensiondifferent from the dimension of the cell input. For example, when thecell is a fully convolution neural network with stride 2, given that thecell input is a H×W×F tensor, the cell output can be a H′×W′×F′ tensor,where H′=H/2, W′=W/2, and F′=2F.

In some cases, a cell includes B operation blocks, where B is apredetermined positive integer. For example, B can be three, five, orten. Each operation block in the cell receives one or more respectiveinput hidden states, and applies one or more operations on the inputhidden states to generate a respective output hidden state.

In some implementations, each of the B operation blocks is configured isconfigured to apply a first operation to a first input hidden state tothe operation block to generate a first output. The operation block isconfigured to apply a second operation to a second input hidden state tothe operation block to generate a second output. The operation block isthen configured to apply a combining operation to the first and secondoutputs to generate an output hidden state for the operation block. Thefirst input hidden state, the second input hidden state, the firstoperation, the second operation, and the combining operation can bedefined by a set of hyperparameters associated with the operation block.For instance, the set of hyperparameters corresponding to the operationblock includes the following hyperparameters: a first hyperparameterrepresenting which hidden state is used as the first input hidden state,a second hyperparameter representing which hidden state is used as thesecond input hidden state, a third hyperparameter representing whichoperation is used as the first operation, a fourth hyperparameterrepresenting which operation is used as the second operation, and afifth hyperparameter representing which operation is used as the combingoperation to combine the outputs of first operation and the secondoperation.

An example architecture of a cell is described in more detail below withreference to FIG. 2.

To determine the architecture of the output cell 150, the NAS system 100includes a performance prediction neural network 110 (also referred toas “the predictor 110”) that has a plurality of performance predictionparameters (also referred to in this specification as “predictionparameters”). The predictor 110 is a recurrent neural network thatincludes one or more recurrent neural network layers. For example, thepredictor 110 can be a long short-term memory (LSTM) neural network or agated recurrent unit (GRU) neural network.

Generally, the predictor 110 is configured to receive data specifying acandidate cell and to process the data in accordance with the predictionparameters to generate a performance prediction that characterizes howwell a neural network having the candidate cell would perform afterbeing trained on the particular machine learning task. The dataspecifying the candidate cell is a sequence of embeddings that definethe candidate cell (e.g., embeddings of multiple sets of hyperparameterswith each set of hyperparameters defining a respective operation blockincluded in the candidate cell). An embedding as used in thisspecification is a numeric representation of a hyperparameter, e.g., avector or other ordered collection of numeric values. The embeddings canbe pre-determined or learned as part of training the predictor.

The performance prediction can be, for example, a prediction of theaccuracy of the trained neural network. As another example, theperformance prediction can include both a predicted mean accuracy and apredicted standard deviation or variance for the accuracy.

In particular, as part of determining the architecture for the outputcell 150 that is repeated throughout the network architecture of thetask neural network, the NAS system 100 obtains data 102 that specifiesa current set of candidate cells for the output cell 150. In some cases,the current set of candidate cells is an initial set of candidate cells.In some other cases, the NAS system 100 obtains cells from a previousiteration and then generates the current set of candidate cells byexpanding each of the previous cells, e.g., by adding a respective oneor more operation blocks to each of the previous cells.

For each of the candidate cells in the current set, the predictor 110receives data specifying the candidate cell and processes the data usingthe performance prediction neural network 110 in accordance with currentvalues of the performance prediction parameters to generate aperformance prediction for each candidate cell.

The NAS system 110 then generate an updated set of candidate cells 112by selecting one or more of the candidate cells in the current set basedon the performance predictions for the candidate cells in the currentset. That is, the NAS system 110 prunes the current set based on thepredictions generated by the performance prediction neural network 110to generate the updated set. For example, the NAS system 110 selects,from the current set, K candidate cells that have the best performancepredictions to include in the updated set 112, where K is apredetermined integer.

To update the values of performance prediction parameters of thepredictor 110, the NAS system 110 includes a training engine 120 and aprediction parameter updating engine 130. Generally, the training engine120 and the prediction parameter updating engine 130 will be implementedas one or more software modules or components, installed on one or morecomputers in one or more locations. In some cases, one or more computerswill be dedicated to a particular engine; in other cases, multipleengines can be installed and running on the same computer or computers.

For each candidate cell in the updated set, the training engine 120 isconfigured to generate an instance of the task neural network having thecandidate cell and to train the instance to perform the particularmachine learning task. For example, the training engine 120 generatesthe instance of the task neural network according to a predeterminedtemplate architecture of the task neural network. For instance, thetemplate architecture of the task neural network includes a first neuralnetwork layer (e.g., a convolutional layer) followed by a stack of Ninstances of a cell followed by an output subnetwork (e.g., an outputsubnetwork that includes a softmax neural network layer).

To train instances of the task neural network, the training engine 120obtains training data for training the instances on the particularmachine learning task and a validation set for evaluating theperformance of the trained instances of the task neural network on theparticular machine learning task.

The training engine 120 can receive the data for training the instancesin any of a variety of ways. For example, in some implementations, thetraining engine 120 receives training data as an upload from a remoteuser of the NAS system 100 over a data communication network, e.g.,using an application programming interface (API) made available by theNAS system 100.

The training engine 120 evaluates a performance of each trained instanceon the particular machine learning task to determine an actualperformance 122 of the trained instance. For example, the actualperformance can be an accuracy of the trained instance on the validationset as measured by an appropriate accuracy measure. For example, theaccuracy can be a classification error rate when the task is aclassification task or an intersection over union difference measurewhen the task is a regression task. As another example, the actualperformance can be an average or a maximum of the accuracies of theinstance for each of the last two, five, or ten epochs of the trainingof the instance.

The prediction parameter updating engine 130 uses the actualperformances for the trained instances to adjust the values of theperformance prediction parameters of the performance prediction neuralnetwork 110. In particular, the prediction parameter updating engine 130adjusts the values of the prediction parameters by training thepredictor 110 to accurately predict the actual performance of candidatecells using a conventional supervised learning technique, for example, astochastic gradient descent (SGD) technique.

By using the predictor 110 to generate a performance prediction for eachof the candidate cells in the current set, the NAS system 110 consideredall candidate cells in the current set. However, the NAS system 110 onlyneeded to actually train a small number of the candidate cells, i.e.,those candidate cells that were selected based on the performancepredictions generated by the predictor 110 for inclusion in the updatedset. Therefore, the NAS system 110 defines a specific technicalimplementation which is more computationally efficient (i.e., consumesmany fewer computational resources) than existing systems that rely onevaluating the performance of a large number of candidate cells byactually training a network having the candidate cell. This is becausetraining the instances of the task neural network is much morecomputationally expensive than just predicting their actual performancesusing the predictor 110. Moreover, in some implementations, thecandidate cells selected by the predictor for inclusion in the updatedset may be trained and evaluated in parallel, thus allowing the NASsystem 100 to determine the output cell faster than traditional systems.

After updating the prediction parameters of the predictor 110, the NASsystem 100 expands the candidate cells in the updated set to generate anew set that includes multiple new candidate cells. In particular, theNAS system 100 expands the candidate cells in the updated set by adding,for each of candidate cells in the updated set, a respective newoperational block having a respective set of hyperparameters to thecandidate cell.

Generally, given that the updated set has N candidate cells, each havingb operation blocks, the NAS system 100 generates, for each particularcandidate cell in the updated set, a subset of all possible cells witheach possible cell having b+1 operation blocks (i.e., by adding a new(b+1)^(th) operation block to the particular candidate cell). The newset is the combination of the subsets of all possible cells having b+1operation blocks.

In some implementations, the new (b+1)^(th) operation block can bespecified by 5 hyperparameters, (I₁, I₂, O₁, O₂, C), where I₁, I₂ϵ

_(b+1) specifies the inputs to new operation block and

_(b+1) is the set of possible inputs to the new operation block; O₁, O₂ϵ

specifies the operations to apply to input I₁ and I₂, respectively,where

is a predetermined operation space; and Cϵ

specifies how to combine O₁ and O₂ to generate a block output H_(b+1)^(c) for the new operation block, where

is the set of possible combination operators.

In these implementations, the search space of possible structures forthe (b+1)^(th) operation block is B_(b+1) that has size |B_(b+1)|=

_(b+1)|²×|

²′×

², where |

_(b+1)|=2+(b+1)−1, |

| is the number of operations in the operation space, and |

| is the number of combination operators in the set

. Therefore, the number of candidate cells in the new set is N×|B_(b+1)|cells.

The NAS system 100 then sets the new set of candidate cells as thecurrent set of candidate cells and repeats the above process until thecandidate cells have a predetermined maximum number of operation blocks.

When the number of operation blocks in each of the candidate cells isequal to the predetermined maximum number of operation blocks, the NASsystem 100 selects a new candidate cell corresponding to the trainedinstance that has the best actual performance as the output cell 150 forthe task neural network.

In some implementations, the system 100 provides data specifying thearchitecture of the output cell, e.g., to a user device over a network,once the output cell 150 has been determined. Instead of or in additionto the providing the data specifying the architecture, the system 100trains a neural network having the determined output cell 150, e.g.,either from scratch or to fine-tune the parameter values generated as aresult of training a larger neural network, and then uses the trainedneural network to process requests received by users, e.g., through theAPI provided by the NAS system 100.

While this specification describes searching the space of possiblearchitectures for a cell that is repeated multiple times throughout thetask neural network, in some other implementations, the NAS system 100searches for a portion of the architecture that is not repeated, e.g.,through possible architectures for the entire task neural network otherthan one or more predetermined output layers and, optionally, one ormore predetermined input layers.

FIG. 2 illustrates an architecture of an example cell 200 that can beused to construct a task neural network.

The cell 200 is a fully convolutional neural network that is configuredto process a cell input (e.g., a H×W×F tensor) to generate a cell output(e.g., H′×W′×F′ tensor).

In some implementations, for example when the cell 200 is a fullyconvolutional neural network with stride 1, the cell output may have asame dimension as the cell input (e.g., H′=H, W′=W and F′=F). In someother implementations, the cell output may have a dimension differentfrom the dimension of the cell input. For example, when the cell is afully convolution neural network with stride 2, given that the cellinput is a H×W×F tensor, the cell output can be a H′×W′×F′ tensor, whereH′=H/2, W′=W/2, and F′=2F.

The cell 200 includes a plurality of operation blocks (B blocks). Forexample, as shown in FIG. 2, the cell 200 includes 5 blocks: blocks 202,204, 206, 208, and 210.

Each block b in the cell 200 can be specified by 5 hyperparameters, (I₁,I₂, O₁, O₂, C), where I₁, I₂ϵ

_(b) specifies the inputs to block b; O₁, O₂ϵ

specifies the operations to apply to input I₁ and I₁, respectively,where

is an operation space; and Cϵ

specifies how to combine O₁ and O₂ to generate a block output H_(b) ^(c)for block b, where

is the set of possible combination operators.

The set of possible inputs,

_(b), is the set of all previous blocks in the cell 200, {H_(b) ^(c), .. . , H_(b−1) ^(c)}, plus the output of the previous cell, H_(B) ^(c-1),plus the output of the cell preceding the previous cell, H_(B) ^(c-2).

The operation space

may include, but not be limited to, the following operations: 3×3depthwise-separable convolution, 5×5 depthwise-separable convolution,7×7 depthwise-separable convolution, 1×7 followed by 7×1 convolution,identity, 3×3 average pooling, 3×3 max pooling, and 3×3 dilatedconvolution.

In some implementations, the set of possible combination operators

includes an addition operation and a concatenation operation.

In some implementations, the set of possible combination operators

includes only an addition operation. In these implementations, eachblock b of the cell 200 can be specified by 4 hyperparameters (I₁, I₂,O₁, O₂).

After each block b generates a block output, the block outputs of allblocks are combined, e.g., concatenated, summed, or averaged, togenerate a cell output H^(c) for the cell 200.

FIG. 3 shows an architecture of an example task neural network 300. Thetask neural network 300 is configured to receive a network input 302 andto generate a network output 320 for the input 302.

The task neural network 300 includes a stack of cell instances 306. Thestack 306 includes multiple instances of a cell that are stacked oneafter the other. The cell instances in the stack 306 may have the samestructure but different parameter values. The number of filters ofconvolutional operations within the cell instances in the stack 306 maydiffer based on a position of the instances within the stack. Forexample, in one implementation, the cell instance 308 is a stride-2 celland the cell instance 310 is a stride-1 cell. In such implementation,the cell instance 308 has twice as many filters as the cell instance 310has.

The first cell instance 308 in the stack 306 is configured to receive afirst cell input and to process the first cell input to generate a firstcell output.

In some cases, the first cell input is a network input 302 of the taskneural network.

In some other cases, the network input 302 is an image and the taskneural network 300 may include a convolutional neural network layer 304preceding the stack of cells 306 in order to reduce computational costsassociated with processing the image. For example, the convolutionalneural network layer 304 is a 3×3 convolutional filter layer with stride2. In these cases, the convolutional neural network layer 304 isconfigured to process the network input 302 to generate an intermediateoutput to be provided as the first cell input to the cell instance 308.

Each cell instance following the first cell instance (e.g., cellinstances 310-312) is configured to receive as input the cell output ofthe previous cell instance and to generate a respective cell output thatis fed as input to the next cell instance. The output of the stack 306is the cell output of the last cell instance 314.

The task neural network 300 includes a sub-network 316 following thestack of cell instances 306. The sub-network 316 is configured toreceive as input the output of the stack of cell instances 306 and toprocess the output of the stack 306 to generate the network output 320.As an example, the sub-network 316 includes a global pooling neuralnetwork layer followed by a softmax classification neural network layer.

FIG. 4 is a flow diagram of an example process 400 for determining thearchitecture of a cell that is repeated throughout a task neuralnetwork. For convenience, the process 400 will be described as beingperformed by a system of one or more computers located in one or morelocations. For example, a neural architecture search system, e.g., theneural architecture search system 100 of FIG. 1, appropriatelyprogrammed in accordance with this specification, can perform theprocess 400.

The system obtains data specifying a current set of candidate cells foran output cell that is used to construct the task neural network (step402).

In some cases, the current set of candidate cells is an initial set ofcandidate cells. In some other cases, the system obtains cells from theprevious iteration and then generates the current set of candidate cellsby expanding each of the previous cells, e.g., by adding a respectiveone or more operation blocks to each of the previous cells.

The system processes the data specifying the candidate cell using aperformance prediction neural network having a plurality of performanceprediction parameters (step 404). The performance prediction neuralnetwork is configured to process the data specifying the candidate cellin accordance with current values of the performance predictionparameters to generate a performance prediction that characterizes howwell a neural network having the candidate cell would perform afterbeing trained on the particular machine learning task.

The system generates an updated set of candidate cells by selecting oneor more of the candidate cells in the current set based on theperformance predictions for the candidate cells in the current set (step406). That is, the system prunes the current set based on thepredictions generated by the performance prediction neural network togenerate the updated set. For example, the system selects, from thecurrent set, K candidate cells that have the best performancepredictions to include in the updated set, where K is a predeterminedinteger.

The system iteratively performs steps 408-412 for each of the candidatecells in the current set as follows.

The system generates an instance of the task neural network having thecandidate cell (step 408). For example, the system generates theinstance of the task neural network according to a predeterminedtemplate architecture of the task neural network. For instance, thetemplate architecture of the task neural network includes a first neuralnetwork layer (e.g., a convolutional layer) followed by a stack of Ninstances of a cell followed by an output subnetwork (e.g., an outputsubnetwork that includes a softmax neural network layer).

The system trains the instance to perform the particular machinelearning task (step 410).

To train instances of the task neural network, the system obtainstraining data for training the instances on the particular machinelearning task and a validation set for evaluating the performance of thetrained instances of the task neural network on the particular machinelearning task. The system then trains the instance on the training datausing conventional machine learning training techniques.

The system then evaluates the performance of each trained instance onthe particular machine learning task to determine an actual performanceof the trained instance, e.g., by measuring the accuracy of the trainedinstance on the validation data set (step 412).

Once the system has repeated steps 408-412 for all candidate cells inthe current set, the system uses the actual performances for the trainedinstances to adjust the values of the performance prediction parametersof the performance prediction neural network (step 414).

In particular, the system adjusts the values of the predictionparameters by training the performance prediction neural network toaccurately predict the actual performance of candidate cells using aconventional supervised learning technique, for example, a stochasticgradient descent (SGD) technique.

The system then determines whether the number of operation blocks ineach of the candidate cells in the updated set is less than apredetermined maximum number of operation blocks allowed in a cell (step416).

When the number of operation blocks in each of the new candidate cellsin the new set is less than the predetermined maximum number ofoperation blocks allowed in a cell, the system expands the candidatecells in the updated set to generate a new set of candidate cells. Inparticular, the system expands the candidate cells in the updated set byadding, for each of candidate cells in the updated set, a respective newoperational block having a respective set of hyperparameters to thecandidate cell. The system then sets this new set of candidate cells asthe current set of candidate cells and repeats steps 402-416 until thenumber of operation blocks in each candidate cell is equal to themaximum number of operation blocks.

When the number of operation blocks in each of the candidate cells inthe updated set is equal to the predetermined maximum number ofoperation blocks, the system selects a new candidate cell correspondingto the trained instance that has the best actual performance as theoutput cell that is repeated throughout the architecture of the taskneural network (step 418).

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed on it software, firmware, hardware, or acombination of them that in operation cause the system to perform theoperations or actions. For one or more computer programs to beconfigured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification, the term “database” is used broadly to refer toany collection of data: the data does not need to be structured in anyparticular way, or structured at all, and it can be stored on storagedevices in one or more locations. Thus, for example, the index databasecan include multiple collections of data, each of which may be organizedand accessed differently.

Similarly, in this specification the term “engine” is used broadly torefer to a software-based system, subsystem, or process that isprogrammed to perform one or more specific functions. Generally, anengine will be implemented as one or more software modules orcomponents, installed on one or more computers in one or more locations.In some cases, one or more computers will be dedicated to a particularengine; in other cases, multiple engines can be installed and running onthe same computer or computers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer readable media suitable for storing computer programinstructions and data include all forms of non volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto optical disks; andCD ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using a machinelearning framework, e.g., a TensorFlow framework, a Microsoft CognitiveToolkit framework, an Apache Singa framework, or an Apache MXNetframework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back end, middleware, or front end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

1. (canceled)
 2. A method performed by one or more computers, the methodcomprising: determining an architecture for a task neural network thatis configured to perform a particular machine learning task, comprising:obtaining data specifying a current set of candidate architectures forthe task neural network; for each candidate architecture in the currentset: generating an input that specifies the candidate architecture;processing the input that specifies the candidate architecture using aperformance prediction neural network having a plurality of performanceprediction parameters, wherein the performance prediction neural networkis configured to process the input specifying the candidate architecturein accordance with current values of the performance predictionparameters to generate a performance prediction for the candidatearchitecture without training a candidate task neural network having thecandidate architecture on the particular machine learning task, theperformance prediction characterizing how well a neural network havingthe candidate architecture would perform after being trained on theparticular machine learning task; generating an updated set of candidatearchitectures by selecting one or more of the candidate architectures inthe current set based on the performance predictions for the candidatearchitectures in the current set; and for each candidate architecture inthe updated set, generating an instance of the task neural networkhaving the candidate architecture; training the instance to perform theparticular machine learning task; and evaluating a performance of thetrained instance on the particular machine learning task to determine anactual performance of the trained instance; and using the actualperformances for the trained instances to adjust the current values ofthe performance prediction parameters of the performance predictionneural network using supervised learning; and for each candidatearchitecture in the updated set: generating a plurality of new candidatearchitectures from the candidate architecture by adding, for each newcandidate architecture, a respective new operation block havingrespective hyperparameters to the candidate architecture.
 3. The methodof claim 2, wherein the particular machine learning task comprises imageprocessing.
 4. The method of claim 2, wherein the particular machinelearning task comprises image or video classification.
 5. The method ofclaim 2, wherein the particular machine learning task comprises speechrecognition.
 6. The method of claim 2, further comprising: training atask neural network having the determined architecture; and using thetrained task neural network having the determined architecture toperform the particular machine learning task on received network inputs.7. The method of claim 2, further comprising: for each new candidatearchitecture: processing data specifying the new candidate architectureusing the performance prediction neural network and in accordance withthe updated values of the performance prediction parameters to generatea performance prediction for the new candidate architecture; andgenerating a new set of candidate architectures by selecting one or moreof the new candidate architectures based on the performance predictionsfor the new candidate architectures.
 8. The method of claim 7, furthercomprising: selecting one of the new candidate architectures in the newset as the architecture for the task neural network.
 9. The method ofclaim 8, wherein the selecting comprises: for each new candidatearchitecture in the new set: generating an instance of the task neuralnetwork having the new candidate architecture; training the instance toperform the particular machine learning task; and evaluating aperformance of the trained instance on the particular machine learningtask to determine an actual performance of the trained instance; andselecting a new candidate architecture corresponding to the trainedinstance having the best actual performance as the architecture for thetask neural network.
 10. The method of claim 7, wherein the architecturefor the task neural network comprises a plurality of convolutional cellsthat each share one or more hyperparameters, each of the plurality ofconvolutional cells comprising one or more operation blocks that eachreceive one or more respective input hidden states and generate arespective output hidden state, and wherein each candidate architectureand each new candidate architecture defines values for thehyperparameters that are shared by each convolutional cell.
 11. Themethod of claim 10, wherein each candidate architecture defines anarchitecture for a convolutional cell having a first number of operationblocks.
 12. The method of claim 2, wherein the input specifying thecandidate architecture is a sequence of embeddings that define thecandidate architecture, and wherein the performance prediction neuralnetwork is a recurrent neural network.
 13. The method of claim 12,wherein the performance prediction is an output of the recurrent neuralnetwork after processing a last embedding in the sequence of embeddings.14. A system comprising one or more computers and one or more storagedevices storing instructions that, when executed by the one or morecomputers, cause the one or more computers to perform operationscomprising: determining an architecture for a task neural network thatis configured to perform a particular machine learning task, comprising:obtaining data specifying a current set of candidate architectures forthe task neural network; for each candidate architecture in the currentset: generating an input that specifies the candidate architecture;processing the input that specifies the candidate architecture using aperformance prediction neural network having a plurality of performanceprediction parameters, wherein the performance prediction neural networkis configured to process the input specifying the candidate architecturein accordance with current values of the performance predictionparameters to generate a performance prediction for the candidatearchitecture without training a candidate task neural network having thecandidate architecture on the particular machine learning task, theperformance prediction characterizing how well a neural network havingthe candidate architecture would perform after being trained on theparticular machine learning task; generating an updated set of candidatearchitectures by selecting one or more of the candidate architectures inthe current set based on the performance predictions for the candidatearchitectures in the current set; and for each candidate architecture inthe updated set, generating an instance of the task neural networkhaving the candidate architecture; training the instance to perform theparticular machine learning task; and evaluating a performance of thetrained instance on the particular machine learning task to determine anactual performance of the trained instance; and using the actualperformances for the trained instances to adjust the current values ofthe performance prediction parameters of the performance predictionneural network using supervised learning; and for each candidatearchitecture in the updated set: generating a plurality of new candidatearchitectures from the candidate architecture by adding, for each newcandidate architecture, a respective new operation block havingrespective hyperparameters to the candidate architecture.
 15. The systemof claim 14, wherein the operations further comprise: training a taskneural network having the determined architecture; and using the trainedtask neural network having the determined architecture to perform theparticular machine learning task on received network inputs.
 16. Thesystem of claim 14, wherein the operations further comprise: for eachnew candidate architecture: processing data specifying the new candidatearchitecture using the performance prediction neural network and inaccordance with the updated values of the performance predictionparameters to generate a performance prediction for the new candidatearchitecture; and generating a new set of candidate architectures byselecting one or more of the new candidate architectures based on theperformance predictions for the new candidate architectures.
 17. Thesystem of claim 16, wherein the operations further comprise: selectingone of the new candidate architectures in the new set as thearchitecture for the task neural network.
 18. The system of claim 17,wherein the operations for selecting one of the new candidatearchitectures in the new set as the architecture for the task neuralnetwork comprise: for each new candidate architecture in the new set:generating an instance of the task neural network having the newcandidate architecture; training the instance to perform the particularmachine learning task; and evaluating a performance of the trainedinstance on the particular machine learning task to determine an actualperformance of the trained instance; and selecting a new candidatearchitecture corresponding to the trained instance having the bestactual performance as the architecture for the task neural network. 19.The system of claim 16, wherein the architecture for the task neuralnetwork comprises a plurality of convolutional cells that each share oneor more hyperparameters, each of the plurality of convolutional cellscomprising one or more operation blocks that each receive one or morerespective input hidden states and generate a respective output hiddenstate, and wherein each candidate architecture and each new candidatearchitecture defines values for the hyperparameters that are shared byeach convolutional cell.
 20. The system of claim 19, wherein eachcandidate architecture defines an architecture for a convolutional cellhaving a first number of operation blocks.
 21. The system of claim 14,wherein the input specifying the candidate architecture is a sequence ofembeddings that define the candidate architecture, and wherein theperformance prediction neural network is a recurrent neural network 22.The system of claim 21, wherein the performance prediction is an outputof the recurrent neural network after processing a last embedding in thesequence of embeddings.
 23. One or more non-transitory computer storagemedia storing instructions that when executed by one or more computerscause the one or more computers to perform operations comprising:determining an architecture for a task neural network that is configuredto perform a particular machine learning task, comprising: obtainingdata specifying a current set of candidate architectures for the taskneural network; for each candidate architecture in the current set:generating an input that specifies the candidate architecture;processing the input that specifies the candidate architecture using aperformance prediction neural network having a plurality of performanceprediction parameters, wherein the performance prediction neural networkis configured to process the input specifying the candidate architecturein accordance with current values of the performance predictionparameters to generate a performance prediction for the candidatearchitecture without training a candidate task neural network having thecandidate architecture on the particular machine learning task, theperformance prediction characterizing how well a neural network havingthe candidate architecture would perform after being trained on theparticular machine learning task; generating an updated set of candidatearchitectures by selecting one or more of the candidate architectures inthe current set based on the performance predictions for the candidatearchitectures in the current set; and for each candidate architecture inthe updated set, generating an instance of the task neural networkhaving the candidate architecture; training the instance to perform theparticular machine learning task; and evaluating a performance of thetrained instance on the particular machine learning task to determine anactual performance of the trained instance; and using the actualperformances for the trained instances to adjust the current values ofthe performance prediction parameters of the performance predictionneural network using supervised learning; and for each candidatearchitecture in the updated set: generating a plurality of new candidatearchitectures from the candidate architecture by adding, for each newcandidate architecture, a respective new operation block havingrespective hyperparameters to the candidate architecture.